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(54) Methods and apparatus for retrieving audio information using content and speaker 
information 

(57) Methods and apparatus are disclosed for 
retrieving audio information based on the audio content 
as well as the identity of the speaker. The results of con- 
tent and speaker-based audio information retrieval 
methods are combined to provide references to audio 
information (and indirectly to video). A query search 
system retrieves information responsive to a textual 
query containing a text string (one or more key words), 
and the identity of a given speaker. An indexing system 
transcribes and indexes the audio information to create 
time-stamped content index file(s) and speaker index 
file(s). An audio retrieval system uses the generated 
content and speaker indexes to perform query-docu- 
ment matching based on the audio content and the 
speaker identity. Documents satisfying the user-speci- 
fied content and speaker constraints are identified by 
comparing the start and end times of the document seg- 
ments in both the content and speaker domains. Docu- 
ments satisfying the user-specified content and speaker 
constraints are assigned a combined score that can be 
used in accordance with the present invention to rank- 
order the identified documents returned to the user, with 
the best-matched segments at the top of the list. 
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Description 

Field of the Invention 

moon The oresent invention relates generally to information retrieval systems and, more particularly, to methods 
2^^^ in^mation, such as audio and video information, satisfy^ user-specrf.ed cn- 

teria from a database of multimedia files. 

Background of the Invention 

[0002] information retrieval systems have focused primarily on retrieving text doc ^^J^^^S 
ThP basic orinciDles of text retrieval are well established and have been well documented. See, for example Q. 
^ J^^^X^son.Vi^ev. 1 989. An index is a mechanism that matches descriptor* of _doc- 
umer^^nescrSns of queries. The indexing phase describes documents as a list of words or phrases, and the 
^rSXseTeSi theory as a list of words or phrases. A document (or a portion thereof) ,s retneved when 
the document description matches the description of the query. different from 

rnnrm Data retrieval models required for multimedia objects, such as audio and video files, are quite different tram 

is little consensus on a standaid set of features for indexing such muttimed.a 
££££ One wroaSi for indexing an audio database is to use certain audio cues, such as applause, music or 
™£ Krnilarlv an aooroach for indexing video information is to use key frames, or shot changes. For audio and video 

tSt ^ gen-tLusing a speech recognition system and the transcribed text can be used for mdexmg 

ten?to transcribe the audio information into text for indexing, and a text-based information retrieval system. Speec ^h rec- 
!Si£*^^ams are tvoically guided by three components, namely, a vocabulary, a language model and a set of 
SunciS A vocabulary is a set of words that is used by the speech recognisor to 

? ^TJlih J !2 Ts Mrt of the decoding process, the recognisor matches the acoustics from the speech .nput 
o 2S defines the words that can be transcribed. If a word that fc not ,n 

SevSlaStetoDe recognised, the unrecognised word must first be added to the vocabulary 
STTIZ model is a domain-specific database of sequences of words in the vocabulary. A set of proba- 
SwS of the SsS^ng in a specific order is also required. The output of the speech recognisorw.ll be biased 
S tne hfch proSSl ty wo* sequences when the language model is operative. Thus, correct decoding ,s a func- 
t^ofwh4er S,e user speaks a sequence of words that has a high probability within the language mode Thus when 
the use^ Sate an unusual sequence of words, the decoder periormance will degrade. Word recognition ,s ^based 
entirety ^onte^onmda?on. i.e the phonetic representation of the word. For best accuracy, domam-specrfic lavage 
rZZ muXused. The creation of such a language model requires explicit transcripts of the text along wrth the 

Text-based information retrieva. systems typically work in two phases. The first phase is an o^ine indexing 
ahase where relevant statistics about the textual documents are gathered to build an index. The second phase i is an 
onTne Srti ngTnd retrieval phase, where the index is used to perform query^ocument matching foHowed b _ the 
^n one evam documents (and additional information) to the user. During the indexing phase, the text output from 

* Jo^r 9 ' During the indexing process, the following operations are generally performed, in sequence: (i) tokenization 
n^H*^?**. (m) morphological analysis, and (iv) stop-word removal using a standard sto^word list. 
To^eSatoT^ Morphological analysis is aform of linguae signal Process^ tha^m- 

JSes nouns into their roots, along with atag to indicated plural form. Ukewise, ^.^I^^'^ £ 
fgnating person, tense and mood, along with the root of the verb. For a genera, discussion _<* the mdexmg process, see. 
wtLn^i* q Dharanioraoada et al "Audio-Indexing for Broadcast News." in Proc. SDR97. 1997. 
JST W?«e ^SS-based audio informatfon retrieva, systems allow a user to ^™*f»^^ 
oneor more key words specified in a user-defined query, current audio information retrieval systems do not ^ a user 
to LTecXely retriL relevant audio files based on the identity of the speaker. Thus, a need exists for a method and 
a^a Ssta uSZms audio information based on the audio content as well as the identrty of the speaker, 
ss XT ttts an iject of the invention to provide a technique which alleviates the above drawbacks. 
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Summary of the Invention 

[0010] According to the present invention we provide a method for retrieving audio information from one or more 
audio sources, said method comprising the steps of: 

5 

receiving a user query specifying at least one content and one speaker constraint; and 

comparing said user query with a content index and a speaker index of said audio source to identify audio informa- 
tion satisfying said user query. 

io [001 1 ] Also according to the present invention we provide an audio retrieval system for retrieving audio information 
from one or more audio sources, comprising: a memory that stores a content index and a speaker index of said audio 
source and computer-readable code; and a processor operatively coupled to said memory, said processor configured 
to implement said computer-readable code, said computer-readable code configured to: receive a user query specify- 
ing one or more words and the identity of a speaker; and combine the results of a content-based and a speaker-based 

75 audio information retrieval to provide references to said audio source based on the audio content and the speaker iden- 
tity. 

[0012] A more complete understanding of the present invention, as well as further features and advantages of the 
present invention, will be obtained by reference to the following detailed description and drawings. 

20 Brief Description of the Drawings 

[0013] 

FIG. 1 is a block diagram of an audio retrieval system according to the present invention; 

25 

FIG. 2A is a table from the document database of the content index file(s) of FIG. 1 ; 
FIG. 2B is a table from the document chunk index of the content index f ile(s) of FIG. 1 ; 
30 FIG. 2C is a table from the unigram file (term frequency) of the content index f ile(s) of FIG. 1 ; 

FIG. 2D is a table from the an inverse document index (IDF) of the content index file(s) of FIG. 1 ; 
FIG. 3 is a table from the speaker index f ile(s) of FIG. 1 ; 

35 

FIG. 4 illustrates a representative speaker enrollment process in accordance with the present invention; 

FIG. 5 is a flow chart describing an exemplary indexing system process, performed by the audio retrieval system 
of FIG. 1 ; and 

40 

FIG. 6 is a flow chart describing an exemplary content and speaker audio retrieval system process, performed by 
the audio retrieval system of FIG. 1. 

Detailed Description of Preferred Embodiments 

45 

[0014] An audio retrieval system 100 according to the present invention is shown in FIG. 1. As discussed further 
below, the audio retrieval system 100 combines the results of two distinct methods of searching for audio material to 
provide references to audio information (and indirectly to video) based on the audio content as well as the identity of the 
speaker. Specifically, the results of a user-specified content-based retrieval, such as the results of a Web search 

so engine, are combined in accordance with the present invention with the results of a speaker-based retrieval. 

[0015] The present invention allows a query search system to retrieve information responsive to a textual query 
containing an additional constraint, namely, the identity of a given speaker. Thus, a user query includes a text string 
containing one or more key words, and the identity of a given speaker. The present invention compares the constraints 
of the user-defined query to an indexed audio and/or video database and retrieves relevant audio/video segments con- 

55 taining the specified words spoken by the given speaker. 

[0016] As shown in FIG. 1, the audio retrieval system 100 of the present invention consists of two primary compo- 
nents, namely, an indexing system 500 that transcribes and indexes the audio information, and an audio retrieval sys- 
tem 600. As discussed further below, the indexing system 500 processes the text output from a speech recognition 
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system during the indexing phase to perform content indexing and speaker indexing. During the retrieval phase, the 
content and speaker audio retrieval system 600 uses the content and speaker indexes generated during the indexing 
phase to perform query-document matching based on the audio content and speaker identity and to return relevant 
documents (and possibly additional information) to the user. 

5 [001 7] As discussed below, the speech recognition system produces transcripts with time-alignments for each word 
in the transcript. Unlike a conventional information retrieval scenario, there are no distinct documents in the transcripts 
and therefore one has to be artificially generated. In the illustrative embodiment, for the content-based index, the tran- 
scribed text corresponding to each audio or video file is automatically divided into overlapping segments of a fixed 
number of words, such as 100 words, and each segment is treated as a separate document. In an alternative imple- 

w mentation, topic identification schemes are used to segment the files into topics. Likewise, for the speaker-based index, 
the audio or video file is automatically divided into individual segments associated with a given speaker. Thus, a new 
segment is created each time a new speaker speaks. 

[0018] The present invention establishes the best portions of the audio as determined by the content-based 
retrieval and the speaker-based retrieval. It is noted that the size of a segment in the content based index is about the 
rf time it takes to speak 100 words, which is approximately 30 seconds. The length of a segment in the speaker-based 
index however, is variable, being a function of the speaker change detector. Thus, the segment length cannot be pre- 
dicted Thus, according to a feature of the present invention, the start and end times of the segments in both domains 
are compared. 

[0019] According to a further feature of the present invention, the extent of the overlap between the content and 
2c speaker domains is considered. Those documents that overlap more are weighted more heavily. Generally, as dis- 
cussed further below in conjunction with FIG. 6, the combined score is computed using the following equation: 

combinedsc»re=(rankeddocumentscore+(lambda*speakersegmentscore)) *overlapfactor 

25 [0020] The ranked document score ranks the content-based information retrieval, for example, using the Okapi 
equation, discussed below. The ranked document score is a function of the query terms, and is thus calculated at 
retrieval time. The speaker segment score is a distance measure indicating the proximity between the speaker segment 
and the enrolled speaker information and can be calculated during the indexing phase. Lambda is a variable that 
records the degree of confidence in the speaker identity process, and is a number between zero and one. The overlap 

30 factor penalises segments that do not overlap completely, and is a number between zero and one. The combined score 
can be used to rank-order the identified documents returned to the user, with the best-matched segments at the top of 
the list. 

[0021 ] FIG. 1 is a block diagram showing the architecture of an illustrative audio retrieval system 1 00 in accordance 
with the present invention. The audio retrieval system 100 may be embodied as a general purpose computing system, 

35 such as the general purpose computing system shown in FIG. 1. The audio retrieval system 100 includes a processor 
1 10 and related memory, such as a data storage device 120, which may be distributed or local. The processor 110 may 
be embodied as a single processor, or a number of local or distributed processors operating in parallel. The data stor- 
age device 120 and/or a read only memory (ROM) are operable to store one or more instructions, which the processor 
1 10 is operable to retrieve, interpret and execute. 

40 [0022] The data storage device 120 preferably includes an audio corpus database 150 for storing one or more 
audio or video files (or both) that can be indexed and retrieved in accordance with the present invention. In addition, the 
data storage device 120 includes one or more content index file(s) 200 and one or more speaker index file(s) 300, dis- 
cussed below in conjunction with FIGS. 2 and 3, respectively. Generally, as discussed below in conjunction with FIGS. 
2A through 2D, the content index file(s) 200 includes a document database 210 (FIG. 2A), a document chunk index 240 

45 (FIG. 2B), a unigram file (term frequency) 260 (FIG. 2C) and an inverse document index (IDF) 275 (FIG. 2D). The con- 
tent index f ile(s) 200 are generated in conjunction with a speech recognition system during an indexing phase and 
describes the audio (or video) documents as a list of words or phrases, together with additional indexing information. 
The speaker index f ile(s) 300 is generated in conjunction with a speaker identification system during the indexing phase 
and provides a speaker label for each segment of an audio file. Thereafter, during the retrieval phase, the content index 

so f ile(s) 200 and speaker index file(s) 300 are accessed and a document is retrieved if the document description in the 
content index file(s) 200 matches the description of the user-specified query and the speaker identity indicated by the 
speaker label in the speaker index f ile(s) 300 matches the designated speaker identity. 

[0023] In addition, the data storage device 120 includes the program code necessary to configure the processor 
1 10 as an indexing system 500, discussed further below in conjunction with FIG. 5. and a content and speaker audio 
55 retrieval system 600, discussed further below in conjunction with FIG. 6. As previously indicated, the indexing system 
500 analyses one or more audio files in the audio corpus database 150 and produces the corresponding content index 
f ile(s) 200 and speaker index file(s) 300. The content and speaker audio retrieval system 600 accesses the content 
index f ile(s) 200 and speaker index f ile(s) 300 in response to a user-specified query to perform query-document match- 
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ing based on the audio content and speaker Identity and to return relevant documents to the user. 
INDEX FILES 

5 [0024] As previously indicated, the audio sample is initially transcribed, for example, using a speech recognition 
system, to produce a textual version of the audio information. Thereafter, the indexing system 500 analyses the textual 
version of the audio file(s) to produce the corresponding content index file(s) 200 and speaker index file(s) 300. 
[0025] As previously indicated, the content index f ile(s) 200 includes a document database 210 (FIG. 2A), a docu- 
ment chunk index 240 (FIG. 2B), a unigram file (term frequency) 260 (FIG. 2C) and an inverse document index (IDF) 

io 275 (FIG. 2D). Generally, the content index files 200 store information describing the audio (or video) documents as a 
list of words or phrases, together with additional indexing information. In the illustrative embodiment, the content index 
file(s) 200 records, among other things, statistics required by the Okapi equation. 

[0026] The document database 210 (FIG. 2A) maintains a plurality of records, such as records 21 1 through 214, 
each associated with a different 100 word document chunk in the illustrative embodiment. In one implementation, there 

is is a 50 word overlap between documents. For each document chunk identified in field 220. the document database 210 
indicates the start and end time of the chunk in fields 222 and 224, respectively, as well as the document length in field 
226. Finally, for each document chunk, the document database 210 provides a pointer to a corresponding document 
chunk index 240, that indexes the document chunk. Although documents have a fixed length of 100 words in the illus- 
trative embodiment, the length in bytes can vary. As discussed below, the document length (in bytes) is used to normal- 

20 ise the scoring of an information retrieval. 

[0027] The document chunk index 240 (FIG. 2B) maintains a plurality of records, such as records 241 through 244, 
each associated with a different word in the corresponding document chunk. Thus, in the illustrative implementation, 
there are 100 entries in each document chunk index 240. For each word string (from the document chunk) identified in 
field 250, the document chunk index 240 indicates the start time of the word in field 255. 

25 [0028] A unigram file (term frequency) 260 (FIG. 2C) is associated with each document, and indicates the number 
of times each word occurs in the document. The unigram file 260 maintains a plurality of records, such as records 261 
through 264, each associated with a different word appearing in the document. For each word string identified in field 
265, the unigram f ile 260 indicates the number of times the word appears in the document in field 270. 
[0029] The inverse document index 275 (FIG. 2D) indicates the number of times each word appears in the collec- 

30 tion of documents (the audio corpus), and is used to rank the relevance of the current document amongst all documents 
in which the word occurs. The inverse document index 275 maintains a plurality of records, such as records 276 through 

279, each associated with a different word in the vocabulary. For each word identified by the vocabulary identifier infield 

280, the inverse document index 275 indicates the word string in field 285, the inverse document frequency (IDF) in field 
290 and a list of the documents in which the word appears in field 295. The list of documents in field 295 permits a 

35 determination of whether the word appears in any documents without actually searching. 

[0030] As previously indicated, the speaker index file(s) 300, shown in FIG. 3, provides a speaker label for each 
segment of an audio file. The speaker index f ile(s) 300 maintains a plurality of records, such as records 305 through 
312, each associated with a different segment of an audio file. Each segment of speech is associated with a different 
speaker. For each segment identified in field 325, the speaker index f ile(s) 300 identifies the corresponding speaker in 

40 field 330, and the corresponding audio or video file containing the segment in field 335. In addition, the speaker index 
file(s) 300 also indicates the start and end time of the segment (as offsets from the start of the file) in fields 340 and 
345, respectively. The speaker index f ile(s) 300 indicates a score (distance measure) in field 350 indicating the proxim- 
ity between the speaker segment and the enrolled speaker information, as discussed below in conjunction with FIG. 5. 

45 SPEAKER REGISTRATION PROCESS 

[0031 ] FIG. 4 illustrates a known process used to register or enrol speakers. As shown in FIG. 4, for each registered 
speaker, the name of the speaker is provided to a speaker enrolment process 410, together with a speaker training file, 
such as a pulse-code modulated (PCM) file. The speaker enrolment process 410 analyses the speaker training file, and 

so creates an entry for each speaker in a speaker database 420. The process of adding speaker's voice samples to the 
speaker database 420 is called enrolment. The enrolment process is offline and the audio indexing system assumes 
such a database exists for all speakers of interest. About a minute's worth of audio is generally required from each 
speaker from multiple channels and microphones encompassing multiple acoustic conditions. The training data or data- 
base of enrolled speakers is stored using a hierarchical structure so that accessing the models is optimised for efficient 

55 recognition and retrieval. 
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INDEXING PROCESS 

r00321 As previously indicated, during the indexing phase, the indexing system 500, shown in FIG. 5 processes the 
Eh output from the speech recognition system to perform content indexing and speaker ,ndexmg. As shown , i R05. 
the content indexing and speaker indexing are implemented along two parallel processing branches, wrth (rttoM 
no being permed in steps 510 through 535, and speaker indexing being performed during steps 510 and 550 
5?ou^ however, that the content indexing and speaker indexing can be performed sequentially, as 

would be apparent to a person of ordinary skill in the art. 

SS ^ an initial step for both content indexing and speaker indexing, capstral features are extracted from the 
auto files during step 510 in a known manner. Generally, step 51 0 changes the domam of the aud.o files to the fre- 
quency domam redoes the dynamic range and performs an inverse transform to return the s.gnal to the time domam. 

Content -Indexing 

[0034] The audio information is then applied to a transcription engine, such as the ViaVoice™ speech recognition 
system, commercially available from IBM Corporation of Armonk. NY. during step 515 to ^°*^™ C "*^* 
time-stamped words. Thereafter, the time-stamped words are collected into document chunks of a fixed length, such as 
1 00 words in the illustrative embodiment, during step 520. 

[0035] The statistics required for the content index file(s) 200 are extracted from the aud.o f *» dunng 
Sealed above, the indexing operations include.: (0 tokenization. (ii) part-of-speech tagg.ng. ("0 ^Pholog.oal ana,, 
yste S?(iv) stop-word removal using a standard stop-word list. Tokenization detects sentence boundary Morpho- 
M anXsfs is a form of linguistic signal processing that decomposes nouns into their roots, along wrth a tag to 
iSe the plural form. Likewise, verbs are decomposed into units designating person, tense and mood, along wrth the 

rooaS* 9 During step 530. the indexing system 500 obtains the statistics required by the Okapi equation. For each 
word Wentified in the audio field, the following information is obtained: the term frequency (number of times the word 
SarsTa given document); the inverse document frequency (IDF) (indicating the number of documents ,n wh ch the 
r^ocuif tiTe document length (for normalisation) and a set of chain linked pointers to each document conta.n.ng 

so JSsT ^e^nfo^tonobtained during step 530 is stored in a content index f ile(s) 200 during step 535. or if a con- 
tent index file(s) 200 already exists, the information is updated. 

Speaker - Indexing 

[0038] As discussed further below, the speaker-based information retrieval system consists of two ^^ponente : J [1) 
anacoustic-change detection system (often referred to as speaker segmentation), and (2 a speaker-dependent lan- 
gua^Sependent. text-independent speaker recognition system. To automate toe , *eaker^ca*on , P£~*«» 
boundaries (turns) between non-homogeneous speech portions must be detected dunng step 550^ Each I homoger^ 
ous segment should correspond to the speech of a single speaker. Once delineated, each se 9 ment ^. c, ^'f 
as having been spoken by a particular speaker (assuming the segment meets the min.mum segment length require- 

he well-known Bayesian Information Criterion (BIC). The input audio stream can be model ed as a Gauss.an process 
on the ce^stral space. BIC is a maximum likelihood approach to detect (speaker) turns of a Gauss«n process. The 
P^LTmodeTidentification is to choose one from among a set of candidate models to descr.be a g,ven data set It 
astumTthe frames (10 ms) derived from the input audio signal are independent and resurt from a s-ngl^auss^an 
Process In order to detect if there is a speech change in a window of N feature vectors after the frame .. 1 * ■ <c N. two 
models are buirt. The first model represents the entire window by one Gaussian, characterised by .te mean, and toll cov- 
ariance to £} The second model represents the first part of the window, up to frame .. wrth a f.rst Gauss.an (m. X,}. 
and the second part of the window with another Gaussian (» 2 , Ej,}. The criterion is then expressed as: 

A e/C( />-«(/>*. P. where 
ft(/>^ loglSl-^logll -J |-^log|S 2 | 
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5 and 

is the penalty associated to the window, N 1 = i is the number of frames of the first part of the window, and N 2 = (N-i) is 
the number of frames of the second part; d is the dimension of the frames. Therefore, P reflects the complexity of the 
models, as d + df g* 1 * is the number of parameters used to represent the Gaussians. 

[0040] ABIC <0 implies, taking the penalty into account, that the model splitting the window into two Gaussians is 
10 more likely than the model representing the entire window with only a single Gaussian. The BIC therefore behaves like 
a thresholded-fikelihood ratio criterion, where the threshold is not empirically tuned but has a theoretical foundation. 
This criterion is robust and requires no prior training. 

[0041 ] In the illustrative implementation, the BIC algorithm has been implemented to make it fast without impairing 
the accuracy. The feature vectors used are simply mel- cepstra frames using 24 dimensions. No other processing is 
is done on these vectors. The algorithm works on a window-by-window basis, and in each window, a few frames are tested 
to check whether they are BIC-prescribed segment boundaries. If no segment boundary is found (positive ABIC), then 
the window size is increased. Otherwise, the old window location is recorded, which also corresponds to the start of a 
new window (with original size). 

[0042] A detailed set of steps for a BIC implementation is set forth below. The BIC computations are not performed 
20 for each frame of the window for obvious practical reasons. Instead, a frame resolution r is used, which splits the win- 
dow into M = N/r subsegments. Out of the resulting ( M-1 ) BIC tests, the one that leads to the most negative ABIC is 
selected. If such a negative value exists, the detection window is reset to its minimal size, and a refinement of the point 
detected is performed, with a better resolution. These refinement steps increase the total number of computations and 
impact the speed-performance of this algorithm. Hence, these should be tailored to the particular user environment, 
25 real-time or offline. 

[0043] If no negative value is found, the window size is increased from N M to N s frames using the following rule" 
N j = N M + AN , , with N, also increasing when no change is found: N , - N M = 2(N M -N ,. 2 ) . This speeds up the algo- 
rithm in homogeneous segments of the speech signal. In order not to increase the error rate though, the ANi has an 
upper bound. When the detection window gets too big, the number of BIC computations is further reduced. If more than 

so Mmax subsegments are present, only M MAX - 1) BIC computations will be performed - skipping the first. 

[0044] During step 555, the results of step 550 are used to analyse the features produced during step 510 and to 
generate segment utterances, comprised of chunks of speech by a single speaker. The segment utterances are applied 
during step 560 to a speaker identification system. For a discussion of a speaker identification system, see, for example, 
H.S.M. Beigi et aL. "IBM Model-Based and Frame-By-Frame Speaker- Recognition," in Proc. of Speaker Recognition 

35 and Its Commercial and Forensic Applications, Avignon, France (1998). Generally, the speaker identification system 
compares the segment utterances to the speaker database 420 (FIG. 4) and finds the "closest" speaker. 
[0045] The speaker identification system has two different implementations, a model-based approach and a frame- 
based approach with concomitant merits and demerits. The engine is both text and language independent to facilitate 
live audio indexing of material such as broadcast news. 
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Speaker Identification - - The Model-Based Approach 

[0046] To create a set of training models for the population of speakers in the database, a model Mj for the I th 
speaker based on a sequence of M frames of speech, with the d-dimensional 
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feature vector J I * is computed. These models are stored in 



{/.} 

K J m = 1,...,/ 

I J>i ».« 



55 terms of their 



statistical parameters, such as, consisting of the Mean vector, the Covariance matrix, and the Counts, for the case when 
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a Gaussian distribution is selected. Each speaker, i, may end up with a model consisting of n, distributions. 
[00471 Using the distance measure proposed in H.S.M. Beigi et. al. A Distance Measure Between Collections of 
Distributions and Its Application to Speaker Recognition/ Proc. ICASSP98}. Seattle. WA, 1998. for comparing two such 
models a hierarchical structure is created to devise a speaker recognition system with many different capabilities 

5 including speaker identification (attest a claim), speaker classification (assigning a speaker), speaker verrf. cation (sec- 
ond pass to confirm classification by comparing label with a "cohort" set of speakers whose characteristics match those 
of the labelled speaker), and speaker clustering. _ t . . , ;c . 0 „^ 

[0048] "me distance measure devised for speaker recognition permits computation of an acceptable distance 
between two models with a different number of distributions nj. Comparing two speakers solely based on the parametric 

w representation of their models obviates the need to carry the features around making the task of comparing two speak- 
ers much less computationally intensive. A short-coming of this distance measure for the recognition stage, however, is 
that the entire speech segment has to be used to build the model of the test individual (claimant) before computation of 
the comparison can begin. The frame-by-frame approach alleviates this problem. 

is Speaker Identification - - The Frame-By-Frame Approach 

[0049] Let Mi be the model corresponding to the i th enrolled speaker. Mj is entirely defined by Ihe parameter set. 

tPy 2 yPy)y-i n„ • 



20 



consisting of the mean vector, covariance matrix, and mixture weight for each of the n- t components of speaker .'s Gaus- 
sian Mixture Model (GMM) . These models are created using training data consisting of a sequence of 
M frames of speech, with the d-dimensional feature vector, 

25 $ m) m=1 M • 

as described in tha previous section. If the size of the speaker population is N p , then the set of the model universe is { 
Mj }*_■, ..., Np . The fundamental goal is to find the i such that Mj best explains the test data, represented as a sequence of 
N frames, 

{?n)n=1 N 

, or to make a decision that none of the models describes the data adequately. The following frame-based weighted like- 
lihood distance measure, d i nt is used in making the decision: 



30 



35 



40 



^Pul{f*V* component**/ M t jj 



, where, using a Normal 
representation, 



1 j^.-*JtT?.Mm-*.A 



50 



55 



The total distance. D if of model M; from the test data is then taken to be the sum of all the distances over the total 
number of test frames. . ^ 

[0050] For classif ication, the model with the smallest distance to that of the speech segment is chosen. By compar- 
ing the smallest distance to that of a background model, one could provide a method to indicate that none of the original 
models match very well. Alternatively, a voting technique may be used for computing the total distance. 
[0051] For verif ication, a predetermined set of members that form the cohort of the labeled speaker is augmented 
with a variety of background models. Using this set as the model universe, the test data is verified by testing if the claim- 
ant's model has the smallest distance; otherwise, it is rejected. 

[0052] This distance measure is not used in training since the frames of speech would have to retained for comput- 
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ing the distances between the speakers. The training is done, therefore, using the method for the model-based tech- 
nique discussed above. 

[0053] The index file for speaker-based retrieval is built by taking a second pass over the results of speaker classi- 
fication and verification during step 565. If the speaker identification is verified during step 565, then the speaker label 

5 is assigned to the segment during step 570. As previously indicated, each classification result is accompanied by a 
score indicating the distance from the original enrolled speaker model to the audio test segment, the start and end times 
of the segment relative to the beginning of the audio clip concerned, and a label (name of the speaker supplied during 
enrolment). In addition, for any given audio clip, all the segments assigned to the same (speaker) label are gathered. 
They are then sorted by their scores and normalised by the segment with the best score. For every new audio clip proc- 

10 essed by the system and added to the index, all the labelled segments are again sorted and re-normalized. This infor- 
mation is stored in a speaker index file(s) 300 during step 575, or if a speaker index file(s) 300 already exists, the 
information is updated. 

RETRIEVAL PROCESS 

75 

[0054] As previously indicated, during the retrieval phase, the content and speaker audio retrieval system 600, 
shown in FIG 6. uses the content and speaker indexes generated during the indexing phase to perform query-docu- 
ment matching based on the audio content and speaker identity and to return relevant documents (and possibly addi- 
tional information) to the user Generally, retrieval can be performed using two distinct, non-overlapping modules, one 
20 for content-based and the other for speaker-based retrieval. The two modules can be programmed to run concurrently 
using threads or processes since they are completely independent. In the illustrative implementation both modules run 
sequentially. 

[0055] At retrieval time, the content and speaker audio retrieval system 600 loads the same vocabularies, tag dic- 
tionaries, morphological tables and token tables that were used in indexing during steps 610 and 20. The appropriate 
25 content index f ile(s) 200 and speaker index f ile(s) 300 are loaded into memory during step 620. A test is performed dur- 
ing step 625 until a query is received. 

[0056] The query string is received and processed during step 630. In response to a received textual query, the 
query string is compared during step 635 against the content index file(s) 200 to compute the most relevant docu- 
ments) using an objective ranking function (ranked document score). The ranked document score that is used in the 
30 ranking of these documents is also recorded for subsequent computing of the combined scores in accordance with the 
present invention (step 645) . 

[0057] The following version of the Okapi formula, for computing the ranked document score between a document 
d and a query q is used: 

S(d,Q)= £ c q {q k ) idf(q k ) 

*»1 a,+a 2 -j+c d (q k ) 

Here, q k is the k th term in the query, Q is the number of terms in the query, c q (qk) and c d (q^ are the counts of the k th 
term in the query and document respectively, l d is the length of the document, I is the average length of the documents 
in the collection, and idf(qkk) is the inverse document frequency for the term q k which is given by: 

where N is the total number of documents and n{q^) is the number of documents that contain the term c^. The inverse 
document frequency term thus favours terms that are rare among documents. (For unigrams, a 1 = 0.5 and a 2 = 1.5). 
Clearly, the idf can be pre-calculated and stored as can most of the elements of the scoring function above except for 
the items relating to the query. 

[0058] Each query is matched against all the documents in the collection and the documents are ranked according 
to the computed score from the Okapi formula indicated above. The ranked document score takes into account the 
number of times each query term occurs in the document normalised with respect to the length of the document. This 
normalisation removes bias that generally favour longer documents since longer documents are more likely to have 
more instances of any given word. This function also favours terms that are specific to a document and rare across 
other documents. (If a second pass is used, the documents would be re-ranked by training another model for docu- 
ments, using the top- ranked documents from the first pass as training data.) 
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f00591 Thereafter the identified documents (or a subset thereof) are analysed during step 640 to determine if the 
speaker id^S Speaker index fi.e(s) 300 matches the speaker specified by the user in the query. SpeaticaHy. 
Se S5 £££ ToHhe ranked documents satisfying the content-based query are compared with those documents ^sat- 
!?yiTttie%eaker.based query to identify documents wrth overlapping start and end times. A angle segment from 
speaker retrieval may overlap with multiple segments from text retrieval. 

[00601 The combined score for any overlapping documents is computed during step 645 as follows: 

Combinedscore=(rankeddocumentscore+(lambda*speakersegmentscore)) 
in the manner described above. All of the scored documents are then ranked and normalised with the most relevant 

ST" S1K^2^ -« - returned to the user. Thus, a list of start and end times of the N 
beS-matched segments, together with the match-scores, and the matched words that contnbuted to ^ relevance 
sSreTe returned during step 650. The default start time of each combined result ,s the same as the start time for tine 
co^eind^ TcumeJ from the content-based search. (The other choice is to use the start time erf the speaker seg- 
ment^e ei time is set to the end of the speaker segment (simply to let the speaker f injh his «**^J^E£ 
for usability reasons, the segment can be truncated at a fixed duration, such as 60 seconds. ..e.. two times as long as 
the average document length. 

USER INTERFACE 

[00621 The illustrative user interface is capable of showing all the relevant information for each of the N selections 
eforned by the retrieval engine, and on further selection uses a media handler component. '^^^j™ 
Media Filter to display MPEG-1 video via a VCR-like interface. The Java application ,s response fo locajng tine vdeo 
tifos (wrTch can be on a server if the PC is networked), and then uses information gathered dur.ng retneva Ho ^embe.hsh 
the rSts such as displaying the retrieved document, associated information such as med.af.le name, start time, end 
SneTanJ ^ normSedlcore a graphic view of where in the media file the retrieved segment lies, highlighting the query 
words (anci Mother morphs that contributed to the ranking of that document) - this is relevant only for content-based 
searching or permitting highlighting of portion of the displayed retrieved document for play back. 
KKJ top N retriev ed items are presented to the user in a compact form. This lets the user visually review the 

eweved item ^further action. Generally, rt includes all the gathered information about the retrieved document indud- 
m rDTrtioTo^e text of the document. When one of the retrieved items is selected for perusal of the audio or video. 
S ^Sf^S^ner* is cabled upon to locate the media file, advance to the ^^^^SZ 
the stream (if required), and then initialise the media player with the first frame of the audio or v deo. The VCR-hke inter 
face oemtits the user o "play" the retrieved video from start to finish or to stop and advance at any juncture. 
^SS^X!^p^JL» can be made within the context of our approach to content-based 
™ audio The current set of documents derived from the speech recognition output can be augmented by induing 
nTnS-best guesses for each wo* or phrase from the recognisor. This informations be used for we.ght.ng the, ndex 
teVm^auS expansion, and retrieval. Also, better recognition accuracy can be had by detecting segments wrth music 
oTmlsJy noiseTa^nly pure speech is indexed for retrieval. One limitation with the current approach to audio-indexing 

voSbulary used in the speech recognisor. Words such as proper nouns and -HnM 
net a e important from an information retrieval standpoint are often found missing in the vocabulary and hence un the 
r^gnfee^transaipts. One method to overcome this limitation is to complement the speech recognisor w * a words- 
loTthe out of vocabulary words. For this approach to be practical, however, one has to have the abihty to detect 
spoken words in large amounts of speech at speeds many times faster than real-time. 

Claims 

1 . A method for retrieving audio information from one or more audio sources, said method comprising the steps of: 

receiving a user query specifying at least one content and one speaker constraint; and 
cXaring sad user query wrth a content index and a speaker index of said audio source to identify audio mfor- 
mation satisfying said user query. 

The method of claim 1. wherein said content index and said speaker index are «™f^*J^^™^™ 
step further comprises the step of comparing the start and end times of the document segments in both the content 
and speaker domains. 
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3. The method of any preceding claim, wherein said content index includes the frequency of each word in said audio 
source. 

4. The method of any preceding claim, wherein said content index includes the inverse document frequency (IDF) of 
each word in said audio source. 

5. The method of any preceding claim, wherein said content index includes the length of said audio source. 

6. The method of any preceding claim, wherein said content index includes a set of chain linked pointers to each doc- 
ument containing a given word. 

7. The method of any preceding claim, wherein said speaker index includes a score indicating the distance from an 
enrolled speaker model to the audio test segment. 

8. The method of any preceding claim, wherein said speaker index includes the start and end times of each audio 
segment. 

9. The method of any preceding claim, wherein said speaker index includes a label identifying the speaker associated 
with the segment. 

10. The method of any preceding claim, wherein said comparing step further comprises the step of comparing docu- 
ments satisfying the content-based query with documents satisfying the speaker-based query to identify relevant 
documents. 

11 . The method of any preceding claim, further comprising the step of transcribing and indexing said audio source to 
create said content index and said speaker index. 

12. The method of claim 1 1 , wherein said step of creating said speaker index comprises the steps of automatically 
detecting turns in said audio source and assigning a speaker label to each of said turns. 

13. The method of any preceding claim, further comprising the step of returning at least a portion of said identified 
audio information to a user. 

14. The method of any preceding claim, further comprising the step of assigning a combined score to each segment of 
said identified audio information and returning at least a portion of said identified audio information in a ranked-list. 

1 5. The method of claim 14, wherein said combined score evaluates the extent of the overlap between the content and 
speaker domains. 

16. The method of claim 14, wherein said combined score evaluates a ranked document score ranking the content- 
based information retrieval. 

1 7. The method of claim 1 4, wherein said combined score evaluates a speaker segment score measuring the proximity 
between a speaker segment and enrolled speaker information. 

18. The method of any preceding claim, wherein said speaker constraint includes the identity of a speaker. 

19. The method of any preceding claim, wherein said content constraint includes one or more keywords. 

20. An audio retrieval system for retrieving audio information from one or more audio sources, comprising: 

a memory that stores a content index and a speaker index of said audio source and computer-readable code; 
and 

a processor operatively coupled to said memory, said processor configured to implement said computer-read- 
able code, said computer-readable code configured to: 

receive a user query specifying one or more words and the identity of a speaker; and 

combine the results of a content-based and a speaker-based audio information retrieval to provide references 
to said audio source based on the audio content and the speaker identity. 
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21 The audio retrieval system of claim 20. wherein said content index and said speaker index are time-stamped and 
' said processor is further configured to compare the start and end times of the document segments .n both the con- 
tent and speaker domains. 

22. The audio retrieval system of claim 20 or 21, wherein said content index includes the frequency of each word in 
said audio source. 

23. The audio retrieval system of any claim 20 to 22. wherein said content index includes the inverse document fre- 
quency (IDF) of each word in said audio source. 

24. The audio retrieval system of any claim 20 to 23. wherein said speaker index includes a score indicating the dis- 
tance from an enrolled speaker model to the audio test segment. 

25. The audio retrieval system of any claim 20 to 24. wherein said speaker index includes a label identifying the 
is speaker associated with the segment. 

26 The audio retrieval system of any claim 20 to 25. wherein said processor is further configured to compare docu- 
ments satisfying the content-based query with documents satisfying the speaker-based query to identify relevant 
documents. 

27. The audio retrieval system of any claim 20 to 26, wherein said processor is further configured to transcribe and 
index said audio source to create said content index and said speaker index. 

28. The audio retrieval system of any claim 20 to 27, wherein said processor is further configured to assign a combined 
score to each segment of said identified audio information and return at least a portion of sa.d identified audio infor- 
mation in a ranked-list. 

29. The audio retrieval system of claim 29. wherein said combined score evaluates the extent of the overlap between 
the content and speaker domains. 

30. The audio retrieval system of claim 29. wherein said combined score evaluates a ranked document score ranking 
the content-based information retrieval. 

31. The audio retrieval system of claim 29, wherein said combined score evaluates a speaker segment score measur- 
35 ing the proximity between a speaker segment and enrolled speaker information. 

32. A computer program comprising computer program code means adapted to perform all the steps of any claim 1-19 
when said program is run on a computer. 

40 33. The computer program of claim 32 embodied on a computer readable medium. 
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